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Abstract 

The random initialization of weights of a multilayer perceptron makes 
it possible to model its training process as a Las Vegas algorithm, i.e. a 
randomized algorithm which stops when some required training error is 
obtained, and whose execution time is a random variable. This model- 
ing is used to perform a case study on a well-known pattern recognition 
benchmark: the UCI Thyroid Disease Database. Empirical evidence is 
presented of the training time probability distribution exhibiting a heavy 
taU behavior, meaning a big probability mass of long executions. This 
fact is exploited to reduce the training time cost by applying two simple 
restart strategies. The first assumes full knowledge of the distribution 
yielding a 40% cut down in expected time with respect to the training 
without restarts. The second, assumes null knowledge, yielding a reduc- 
tion ranging from 9% to 23%. 
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1 Introduction 

The training time of a Multilayer Perceptron (MLP), understood as the time 
needed to obtain some required training error, is a random variable which de- 
pends on the random initialization of the MLP weights. 

These weights are commonly initialized according to a given probability dis- 
tribution, having this choice a significant impact on the training time distribu- 
tion (see Delashmit & Manry 2002, Duch, et al. 1997, LeCun, et al. 1998). To 
address this problem, some weight initialization methods have been proposed 
(e.g. Duch et al. 1997, Weymaere & Martens 1994). They attempt to reduce 
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the training time by applying different probability distributions on the initial 
weights of the MLP based on knowledge about the training set. 

In this correspondence, a simpler and more general approach which does not 
make use of the mentioned information is presented. To do this, we model the 
learning process of a MLP as a las Vegas algorithm (Luby, et al. 1993), i.e. 
a randomized algorithm which meets three conditions: (i) it stops when some 
pre-defined training error S is obtained, (ii) its only measurable observation is 
the training time, and (iii) it only has either full or null knowledge about the 
training time probability distribution. 

Usingthis modeling, we perform a case study with the UCI Thyroid Disease 
databasqj, revealing that the time distribution for learning this pattern recog- 
nition benchmark belongs to the heavy tail distribution family. This type of 
distributions is regarded as non-standard for its big probability mass of arbi- 
trary long values. 

We make use of formal and experimental results which prove that the ex- 
pected execution time of a random algorithm with such underlying distribution 
can be reduced by using restart strategies (Gomes 2003). This work adapts 
these strategies to the MLP context: the MLP is trained during a number of 
epochs ti. If the required training error S is achieved before ^i, then the exe- 
cution finishes. Otherwise, we initialize again the weights in a randomized way, 
and re-train the MLP during t2 epochs. The process is iteratively repeated until 
the training error S is reached, being ti the restart threshold (in epochs) after 

1 — 1 restarts have been performed. 

Two different strategies are applied for the determination of optimal restart- 
ing times. The first assumes full knowledge of the distribution yielding a 40% 
cut down in expected time with respect to the training without restarts. The 
second assumes null knowledge, yielding a reduction ranging from 9% to 23%. 

The rest of the paper is organized as follows. Section [2] presents the Thyroid 
Disease database and provides evidence of heavy tail behavior when a MLP is 
trained on it. Section [3] tests the condition to be satisfied by the probability 
distribution to profit from restart strategies, providing an empirical evaluation 
of two strategies on the particular case study. Finally, some conclusions and 
future research lines are given in section 21 

2 A case study: the UCI Thyroid Disease Database 

To motivate the use of restarts in MLP learning, we firstly present the exis- 
tence of a high variability in its training time, indicative of an underlying heavy 
tail behavior. The evaluation was performed using the UCI Thyroid Disease 
database, as a case study. 

Table [1] shows the expectations, deviations (and its ratio) of the numbers of 
epochs T spent in building a single hidden layer MLP with n = 1, . . . , 8 units. 
The MLP was trained using the well-known Back-Propagation technique with a 

^The UCI Repository of Machine Learning Databases, available online at 
http: / /www.ics.uci.edu/''mlearn/MLRepository.html 
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n 


1 


2 


3 


4 


5 


6 


7 


8 


E[T] 


8551.7 


5516.8 


888.5 


2339.7 


1680.2 


587.6 


482.4 


490.5 


a[T] 


2547.5 


3885.6 


1565.5 


2848.8 


1355.6 


55.1 


296.9 


464.1 


a[T]/E[T] 


30% 


70% 


156% 


106% 


79% 


10% 


60% 


95% 



Table 1: Expectation, deviation (and its ratio) of the number of epochs T spent 
in the building of a MLP with n hidden units and training error d — 0.02. The 
training algorithm was run 1,000 times for each number of hidden units. 



target training error S = 0.02. The results shown were computed using 10-fold 
cross validation. 

The obtained deviations are very large respect to the expectations for most 
of the architectures. For the rest of the experiments, we shall use a MLP with 
n — 3 hidden units, which has the highest relative variability. This will serve as 
a proof of concept, although the same behavior is observed in MLPs with other 
number of hidden units. 

In the following, we give visual evidence that T is heavy tailed, i.e. that the 
probability of the training time T being greater than some number of epochs 
t has polynomial decay, viz. P[T > t] ^ C.t~°', where a G (0,2), C is some 
constant, and t > 0. 

Figure [1] presents a log-log plot of P[T > t] for the 10% largest values 
{t > 3,000). The plot confirms the polynomial decay by displaying a straight 
line with slope —a. This is because, for sufficiently large t, logP[T > t] = 
-alogC.t =^ log P[T > t]/\ogC.t « -a. 

Finally, we verify that a belong to the (0, 2) interval by computing the Hill's 
(1975) estimator: 

— I ^ 111 ^m.m— j + l ^^^m.m—r I ■> 

where Tm,i < Tm,2 < • . • < Tm,m are the m ordered training completion times, 
and r < m is a cutoff that allows to observe only the highest values (the tail). 
We use the typical cutoff r = 0.1m and obtain ctr — 1.942, which is consistent 
with our hypothesis. 

This polynomial decay, which yields a big probability mass for long exe- 
cutions, is due to the fact that certain initial weights entail a convergence to 
local minima of the target function, requiring very long (even infinite) training 
periods, while others yield a convergence to global minima in a few epochs. 

3 Restart strategies 

A las Vegas algorithm may profit from restarting if, at some point of the ex- 
ecution T, the expected completion time conditioned to the already employed 
execution time {E[T — t\T > r]) is larger than the (unconditioned) expected 
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Figure 1: A log- log plot of P[T > t] as a function of t (in epochs). 

completion time {E[T]), i.e. if 3t, E[T] < E[T - t\T > r] (see van Moorsel & 
Welter 2004). 

Figure [2] shows that the majority of r values met the condition for the MLP 
to profit of restart strategies. 

3.1 Restart strategies when the distribution is known 

Luby et al. (1993) prove the existence of an optimal restart strategy for a Las 
Vegas algorithm which minimizes the expected running time when the execution 
time distribution q{t) — Pr(r < t) is assumed known. 

This optimal strategy is a fixed restart threshold for al iteration of the form 
ti = t* where 

t* = arg mm E[St] = arg min ^ ( i - V g(i') ) (1) 

and St is the restart strategy where ti = t \/i for some t. We assume some 
discretization of the time, so that expressions like t' < t make sense. 

Simple calculations yield t* = 418, with an optimal expected time E[St] = 
546.876. This provides a 40% cut down in expected time with respect to the 
training without restarts (see Table [1]) . Figure [3] displays the expected time 
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Figure 2: E\T — t\T > t] as a function of r, E\r] serves as the baseline. 



for strategies of the form St with t G [100, 10,000]. As it can be seen, many 
non-optimal t choices provide a time reduction as well. 

3.2 Restart strategy when the time distribution is un- 
known 

In some scenarios it is not possible to assume full knowledge of the distribution, 
e.g. if the MLP is to be trained a single time. In this subsection we assume null 
knowledge. 

Again Luby et al. (1993) prove the existence of an optimal strategy for this 
assumption, and Walsh (1999) derives a simpler variant of the former which is 
commonly used in practical applications. The Walsh strategy Sw is defined as 
ti = 7*~^, 7 > 1. This strategy benefits of a high probability of success when 
ti = is near to t* . Increasing ti geometrically makes it sure to reach t* in 
a few generations, expecting to reach error 5 within few restarts after the value 
of ti surpasses the optimal. 

Figure [4] displays the expected values of Sw using several standard 7 values 
7 = 2,3,..., 10. Training is speeded with all choices, with improvements ranging 
from 9% (7 = 2) to 23% (7 = 8). The expected times were computed running 
1, 000 times the training algorithm for each 7. 
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Figure 3: Expected training time using the strategy St with t G [100, 10,000], 
E[T] servers as the basehne. 

4 Conclusions and future work 

In this work, MLP training algorithm is modeled as a Las Vegas algorithm, 
performing a case study on the UCI Thyroid Disease Database. We give visual 
and numerical evidence that the probability distribution of the training time 
belongs to the heavy tail family, meaning a polynomial probability decay for 
long executions. This property is exploited to reduce the training time cost 
by two simple strategies. The first assumes full knowledge of the distribution 
yielding a 40% cut down in expected time with respect to the training without 
restarts. The second, assumes null knowledge, yielding a reduction ranging from 
9% to 23%. 

As a future research, we plan to determine whether further improvements 
can be obtained by relaxing las Vegas algorithms assumptions (ii) and (iii) (see 
section[T]). This could make it possible to incorporate dynamic restart strategies 
(see Kautz, et al. 2002) capable of exploiting epoch-by-epoch information about 
the training time distribution, using various algorithm behavior measurements 
besides the execution time. 
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Figure 4: Expected training time using the Walsh strategy E[Sw] for 7 = 
1,2, ... E[S't] and E[T] serve as baselines. 
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